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Abstract 


To  accurately  and  omiprehensively  monitor  a  program’s  behavior,  mapy  performance  measurement 
tools  must  transfwm  the  program’s  executable  representation  or  binary.  By  instrumenting  binary  programs 
to  monitor  program  events,  tools  can  precisely  analyse  compiler  optimization  effectiveness,  memory  system 
performance,  pipeline  interlocking,  and  other  dynamic  program  characteristics  that  are  fully  exposed  only  at 
thi«  levd.  Binary  transformatimi  has  also  been  used  to  support  software-enforced  fault  isolation,  debugging, 
madiine  re-targeting  and  machine-dependent  optimisation. 

At  present,  binary  transformation  applications  face  a  difficult  trade-off.  Previous  approaches  to  im¬ 
plementing  robust  transformations  incur  significant  disk  space  and  run-time  overhead.  To  improve  efficiency, 
some  current  qvtems  sacrifice  robustness,  relying  on  heuristic  assumptions  about  the  program  and  recogni¬ 
tion  of  complex,  compiler-dependent  code  generation  idioms.  In  this  paper  we  present  adaptable  binariea,  a 
technique  for  implementing  robust,  efficient,  and  compiler-independent  binary  transformations. 

We  evaluated  a  prototype  implementation  of  adaptable  binaries  under  the  Ultrix  4.2  operating 
system  and  the  MIPS  processor  architecture.  Unng  the  C  spec92  benchmarks,  we  assessed  adetptable 
binaries  in  three  wi^.  First,  we  demonstrated  that  the  information  necessary  to  build  adaptable  binaries 
can  be  compactly  recorded,  increasing  space  overhead  by  only  9%  for  the  spec92  benchmuks.  Second, 
we  measured  the  run-time  overhead  of  previous  ^proaches  to  implementing  robust  binary  transformations, 
and  showed  that  adaptable  binaries  significantly  reduce  this  overhead.  Finally,  we  measured  the  run-time 
transformation  overhead  of  two  user  applications,  pixie  and  HsaSpy.  For  our  benchmark  programs,  using 
adaptable  binaries  eliminates  pixie '  s  1 10%  average  transformation  overhead  and  reduces  NeaSpy '  s  average 
overhead  from  1296%  to  33%. 


1  Introduction 


Program  development  and  monitoring  tools  are  frequently  implemented  in  two  parts.  The  first  part  in- 
straments  a  target  program,  transforming  that  program  to  monitor  its  behavior.  The  transformed  program 
interacts  with  the  second  part  of  the  monitoring  tool,  a  run-time  library  that  records  information  and  takes 
action  in  resp<»se  to  events  in  the  transformed  program.  To  accurately  and  comprehensively  monitor  a 
ptr^ram’s  b^avior,  many  monitoring  tools  must  instrument  the  program’s  executable  representation  or 
trasry.  For  example,  to  measure  memory  system  performance,  a  proving  tool  must  be  aware  of  register 
spilli^  decisions  made  by  the  high-level  language  compiler,  and  instruction  organisation  decisions  made  by 
the  linkn.  At  the  binary  level,  all  such  decisions  have  been  resolved.  By  transforming  binary  programs 
to  monitor  program  events,  tools  can  precisely  analyse  compiler  optimisation  effectiveness,  memory  system 
performance,  pipeline  interlocking,  and  other  dynamic  program  characteristics  that  are  fully  exposed  only 
at  this  level. 

Instrumentation  of  programs  at  the  binary  level  also  simplifies  the  engineering  of  program  monitoring  tools. 
Binary  instrumentation  can  be  made  compiler-independent,  facilitating  the  analysis  of  programs  written  in 
different  high-level  languages.  Applying  transformations  to  the  binary  eliminates  the  complexity  and  cost  of 
recompilation,  f^irther,  by  applying  transformations  to  the  binary  rather  than  higher-level  representations, 
instrumentation  code  cannot  alter  compilation  decisions.  Finally,  binary  transformation  can  be  applied  to 
code,  such  as  system  libraries,  for  which  source  is  typically  unavailable. 

For  these  reasons,  a  number  of  performance  analysis  tools  are  implemented  as  binary  transformations  [BL92, 
BKW90,  Digc,  GH90,  Wal91].  For  example,  pixie  calculates  instruction  frequencies  and  floating  point 
interlocks  bued  on  profile  data  and  the  instruction  sequence  of  each  basic  block  [Digc].  QPT  uses  control  flow 
analysis  to  support  collection  and  compression  of  address  traces  [BL92].  To  aid  programmers  in  identifying 
memory  hierarchy  bottlenecks,  MTool  compares  actual  and  estimated  execution  times  for  different  regions  of 
the  program  [GH90].  Borg,  Kessler,  and  Wall  use  binary  transformation  to  generate  and  analyze  very  long 
multi-program  address  traces  [BKW90]. 

Binary  transformation  has  also  been  used  to  support  software-enforced  fault  isolation,  debugging,  machine 
re-targeting  and  machine-dependent  optimization.  By  inserting  code  to  efficiently  monitor  indirect  con¬ 
trol  transfers  and  memory  updates,  software-enforced  fault  isolation  insures  that  program  errors  in  one 
module  do  not  corrupt  data  in  other  modules  [WLAG93].  By  applying  this  transformation  at  the  binary 
level,  the  fault  isolation  system  eliminates  the  need  for  a  trusted  compiler.  Instrumented  memory  refer¬ 
ences  have  been  used  to  implement  data  breakpoints,  detect  memory  leaks,  and  trap  reads  of  uninitialized 
data  [HJ92,  Cen,  WLG93].  Several  systems  have  used  binary  transformation  to  re-target  a  progr2un  to  a 
new  architecture  [HB89,  SCK'*'93,  Ech92,  BKKM87].  Finally,  optimizations  such  as  code  motion,  dead-code 
elimination,  register  allocation,  and  instruction  scheduling  have  been  applied  to  binary  programs,  exploiting 
the  global  information  available  at  the  binary  level  [Joh90,  SW92,  Wal86,  Wal92]. 

At  present,  binary  transformation  applications  face  a  difficult  trade-off.  Previous  approaches  to  implementing 
robust  transformations  incur  significant  disk  space  and  run-time  overhead.  To  improve  efficiency,  some 
current  systems  sacrifice  robustness,  reljring  on  heuristic  assumptions  about  the  program’s  control  flow  and 
register  usage  and  recognition  of  complex,  compiler-dependent  code  generation  idioms.  This  reliance  on 
heuristic  information  limits  the  scope  and  effectiveness  of  a  binary  transformation  application. 

In  this  paper  we  present  adaptable  binaries  (AB),  a  technique  for  implementing  robust,  efficient,  and  compiler- 
independent  binary  transformations.  Adaptable  binaries  support  three  classes  of  binary  transformation 
operations.  First',  control  operations  allow  transformation  tools  to  distinguish  code  from  data,  identify  basic 
blocks,  and  identify  targets  of  indirect  control  transfer  instructions.  Second,  transformation  tools  can  use 
edit  operations  to  insert,  delete,  and  reorder  machine  instructions.  Finally,  register  operations  make  registers 
available  for  use  by  inserted  code. 

We  have  implemented  and  evaluated  adaptable  binaries  under  the  Ultrix  4.2  operating  system  and  the  MIPS 
architecture.  Using  the  C  SPEc92  benchmarks,  we  evaluated  adaptable  binaries  in  three  ways.  First,  we 
measured  the  additional  disk  space  required  by  adaptable  binaries.  We  show  that  the  information  necessary 
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to  rapport  adaptable  binaries  can  be  compactly  recorded,  increasing  space  overhead  by  only  9%  for  the 
spbc03  benchmarks.  For  comparison,  the  average  space  overhead  of  the  standard  symbol  table  included  in 
all  unstripped  executable  files  is  81%.  Debugging  information  increases  executable  file  sise  by  an  average  of 
123%. 

Second,  we  measured  the  run>time  overiiead  of  previous  approaches  to  implementing  robust  binary  transfer- 
matkms,  and  showed  that  adaptable  binaries  ripiifieantiy  reduce  this  overhead.  These  measurements  of  the 
basic  binary  transformation  operations  show  that  adaptable  binaries  can  significantly  improve  the  perfor¬ 
mance  of  many  binary  transformation  applications.  For  example,  one  of  the  main  sources  of  inefficiency  in 
DEC’S  VAX-to-Alpha  binary  translation  is  the  use  of  machine  emulation  in  cases  where  control  is  transferred 
to  an  unknown  target  instruction.  Adaptable  binaries  eliminate  the  need  for  such  emulation. 

Finally,  we  measured  the  run-time  transformation  overhead  of  two  applications,  pixie  and  NenSpy  [Digc, 
MGA92].  Pixie  is  a  simple  but  widely-used  performance  analysis  tool  that  counts  basic  blocks.  For  our 
benchmark  programs,  relative  to  the  original  execution  time  of  the  program,  we  were  able  to  eliminate 
pixie’s  110%  avnage  transformation  overhead.  MeaSpy  is  a  sophisticated  analysis  tool  for  precisely  iden¬ 
tifying  memory  hierarchy  bottlenecks.  Thking  advantage  of  the  information  available  in  adaptable  binaries 
reduced  its  average  runtime  transformation  overhead  from  1296%  to  33%. 

The  rest  of  the  paper  is  organised  as  follows.  Section  2  defines  the  operations  supported  by  adaptable 
binaries  and  discusses  why  they  cannot  be  implemented  efficiently  under  existing  systems.  Section  3  details 
the  information  and  program  analysis  required  to  build  adaptable  binaries  and  discusses  our  prototype 
implementation.  Sections  4  quantitatively  evaluates  the  space  requirements  and  run-time  performance  of 
different  transformation  strategies.  Section  5  surveys  the  related  work  in  this  area. 


2  Binary  Transformation  Operations 

In  this  section  we  detail  the  fundamental  implementation  issues  that  complicate  the  support  of  control, 
edit,  and  register  operations  on  binary  programs.  We  define  each  set  of  operations  and  outline  existing 
strategies  for  their  support.  We  demonstrate  that,  because  existing  binaries  lack  crucial  control,  relocation 
and  register  information,  these  strategies  must  rely  on  conservative  assumptions  about  the  program  and  must 
resort  to  emulation  when  static  analysis  fails.  In  the  next  section  we  define  adaptable  binaries,  our  solution 
to  addressing  these  issues.  Adaptable  binaries  contain  the  mininum  information  required  for  efficient  and 
robust  support  of  binary  transformation. 

2.1  Control  Operations 

Control  operations  provide  program  control  flow  information  to  binary  transformation  tools.  For  example, 
pixis  counts  basic  blocks  to  obtain  profiling  information.  To  identify  basic  block  boundaries,  pizis  neces¬ 
sarily  relies  on  heuristics,  rendering  it  inaccurate  in  some  cases.  Similarly,  data  breakpoint  systems  can  use 
data  flow  analysis  to  reduce  the  number  of  memory  update  instructions  that  require  monitoring.  To  make 
such  analysis  effective,  the  breakpoint  system  requires  an  accurate  control  flow  graph  [WLG93].  Finally, 
site-specific  optimization  techniques,  such  as  instruction  scheduling  based  on  profiling  information  [SW92], 
also  require  control  flow  information. 

The  fundamental  step  in  implementing  control  operations  is  the  resolution  of  indirect  control  transfers. 
Most  control  transfer  instructions  have  statically  resolvable  targets;  however,  the  target  of  an  indirect  control 
transfer  instruction  is  specified  via  a  register  and  can  only  be  determined  at  run-time.  Common  programming 
language  abstractions  such  as  function  pointers,  case  statements,  and  continuations  are  typically  implemented 
using  indirect  control  transfer  instructions.  Without  adaptable  binaries,  the  presence  of  indirect  control 
transfers  creates  two  significant  problems.  First,  the  utility  of  control  flow  graphs  is  extremely  restricted. 
Because  there  are  no  constraints  on  the  targets  of  indirect  control  transfers,  any  instruction  in  the  program 
must  be  assumed  to  be  a  possible  control  transfer  target.  Applications  that  use  control  flow  graph  analysis  to 
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reduce  inetrumentation  owerhead  must  rely  od  compiler*dependeiit  heuristica  to  gueae  baaic  block  boundariea, 
accepting  reduced  accuracy  and  portability  in  exchange  for  greater  efficiency. 

Second,  without  the  control  information  preaent  in  adi4>table  binariea,  binary  tranaformation  applicationa 
cannot  reliably  diatinguiah  code  from  data.  A  numbtt  of  compilation  environmeito  place  data  in  the  binary ’a 
cede  aefment  Under  aome  object  formata,  aince  only  the  code  aegment  ia  made  read-only,  it  ia  a  natural 
place  for  read-only  data  auch  aa  program  conatanta.  Unfortunately,  many  ayatema  continue  thia  practice  even 
under  object  formata  that  provide  read-only  data  aefmenla.  In  the  preaence  of  indirect  control  tranafera, 
there  ia  no  robuat  metlmd  for  diatinguiahing  thia  data  from  code.  If  data,  miataken  for  code,  ia  inatrumented, 
the  tranaformed  program  will  be  incorrect.  Thia  problem  ia  exacerbated  on  architecturea  with  variable  length 
inatructiona;  code  aequencea  are  not  conatrained  to  b^n  on  word  boundariea,  making  diaaaaembly  dependent 
on  the  aaaumed  atarting  point  of  the  code. 

One  aolution  to  the  problem  of  diatinguiahing  code  from  data,  first  employed  by  pixia,  ia  to  duplicate  the 
code  aegment,  and  inatrument  only  the  duplicated  code  aegment.  The  duplicated  code  segment  is  assumed 
to  contain  only  code  and  thus  some  data  may  be  inatrumented.  All  load  and  store  addresses  are  unaltered; 
hence,  the  data  in  the  original  code  segment  is  used.  The  disadvantage  of  this  approach  is  that  it  doubles 
the  siae  of  the  transformed  program. 

2.2  Edit  Operations 

Edit  operations  are  used  to  insert,  delete,  and  reorder  machine  instructions.  These  operations  are  fundar 
mental  to  all  binary  tranaformation  applications,  since  by  definition  such  applications  modify  the  program 
to  alter  or  monitor  its  behavior.  Edit  operations  may  change  the  addresses  of  instructions  and  data  objects, 
requiring  that  all  references  to  these  objects  be  updated.  Gimputed  addresses  and  instruction  addresses 
stored  in  the  data  segment  cannot  be  reliably  identified  and  updated  statically.  For  example,  a  common 
implementation  strategy  for  case  statements  is  to  use  a  jump  table  stored  in  the  data  segment.  The  jump 
table  stores  the  code  address  for  each  arm  of  the  case  statement.  There  is  no  reliable  way  to  distinguish  a 
jump  table  from  other  kinds  of  data. 

A  simple  solution  to  this  problem,  employed  by  a  number  of  applications  [Digc,  LB92,  Wal91],  is  to  dy. 
namieally  relocate  affected  references.  Consider  a  transformation  application  that  only  changes  the  location 
of  instructions.  Control  transfer  instructions  whose  targets  are  statically  resolvable  can  be  updated  during 
transformation.  Indirect  control  transfers  must  be  dynamically  relocated  using  a  translation  table,  built  at 
transformation  time,  that  maps  old  addresses  to  new  addresses.  Like  code  duplication,  this  technique  doubles 
the  size  of  the  binary  program,  because  the  translation  table  must  be  the  same  size  as  the  code  segment. 
Binary  transformation  systems  can  also  apply  dynamic  relocation  to  loads  and  stores  if  the  locations  of  data 
objects  are  changed. 

As  an  alternative  to  directly  inserting  instrumentation  code,  a  binary  transformation  tool  can  use  out-of-line 
insertion  to  transfer  control  to  instrumentation  code  without  altering  existing  instruction  addresses  [Kes90]. 
To  logically  insert  instrumentation  code  before  a  particular  instruction,  the  instruction  is  replaced  with  a 
control  transfer  instruction  to  the  instrumentation  code.  Before  returning  to  the  original  instruction  stream, 
the  displaced  instruction  is  executed.  Figure  2  depicts  a  simple  example  of  out-of-line  insertion. 

2.3  Register  Operations 

Instructions  inserted  by  a  binary  transformation  tool  may  need  to  use  madiine  registers  to  compute  interme- 
dir  ‘  8  values  or  memory  addresses.  We  call  such  registers  temporary  registers  because  they  are  not  live  across 
instructions  from  the  original  program.  Some  ^>plications,  such  as  the  memory  system  simulation  program 
NsiiSpy,  require  a  large  number  of  temporary  registers  [MGA92].  In  addition,  many  binary  trauisformation 
applications  can  bendSt  from  registers  reserved  for  their  exclusive  use.  For  example,  pixie  uses  a  reserved 
register  to  hold  the  base  of  its  djmamic  relocation  table.  Wahbe,  Lucco,  and  Graham  show  that  reserving 
registers  for  a  data  breakpoint  facility  can  significantly  reduce  overhead  [WLG93]. 
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Tfimporary  ngisten  ue  aimple  to  obtain.  On  entry  to  the  instrumentation  code  the  required  registers  are 
saved;  these  registers  are  restored  upon  exit  from  the  instrumentation  code.  Unfortunately,  this  solution 
incurs  significant  run-time  overhead.  To  reduce  the  run-time  penalty  of  obtaining  temporary  registers,  a 
binary  transformation  tool  can  use  live  register  information  to  avoid  saving  and  restoring  registers  whose 
values  are  no  longer  needed  by  the  original  program. 

However,  calculating  live  register  information  requires  an  accurate  control  flow  graph;  as  stated  above, 
current  systems  cannot  reliably  build  a  control  flow  graph.  Further,  without  interprocedural  register  usage 
information,  transformation  applications  must  assume  that  all  registers  not  mentioned  in  the  procedure  are 
live,  that  procedure  caUs  use  and  define  all  registers,  and  that  all  registers  are  live  on  exit  from  the  procedure. 

Allocating  reserved  registers  also  introduce  significant  run-time  overhead.  All  uses  of  the  reserved  register 
in  the  original  program  must  be  removed.  This  requires  either  obtaining  other  free  registers  or  mapping  the 
reserved  register  to  a  fixed  memory  location.  Because  programs  can  manipulate  the  stack  in  unpredictable 
ways  (such  as  allocating  space  using  the  C  library  call  alloca),  the  stack  can  not  be  used  to  hold  register 
values,  further  complicating  reserved  register  allocation. 


3  Building  Adaptable  Binaries 

In  this  section,  we  define  the  minimum  information  required  for  efficient  and  robust  support  of  control,  edit, 
and  register  operations  on  binary  programs.  An  adaptable  binary  is  any  executable  program  representation 
that  contains  this  information.  We  also  outline  the  implementation  of  our  adaptable  binary  system,  which 
uses  this  information  to  support  binary  transformation. 

3.1  Adaptable  Binary  Information 

The  necessary  information  falls  into  three  categories.  Control  information  provides  a  control  flow  graph  for 
each  procedure,  supporting  control  operations  such  as  the  identification  of  basic  block  boundaries.  Relocation 
information  supports  editing  operations  by  allowing  all  references  to  instruction  and  data  objects  to  be 
statically  updated.  Finally,  register  usage  information  supports  live  register  analysis,  significantly  improving 
the  performance  of  register  operations. 

This  information  can  not  be  reliably  derived  from  current  binary  programs.  Fortunately,  the  necessary 
information  is  readily  available  in  any  compiler  and  can  be  compactly  recorded  using  conventional  binary 
symbol  tables.  To  conserve  space,  only  information  that  is  impossible  to  derive  is  stored  in  the  binuy; 
our  adaptable  binary  system  uses  program  analysis  to  reconstruct  complete  control,  relocation,  and  register 
usage  information. 

3.2  Control  Information 

The  information  needed  for  control  operations  can  be  synthesised  by  constructing  a  control  flow  graph  for 
each  procedure.  Three  components  of  the  control  flow  graph  can  not  be  reliably  derived  and  are  maintained 
in  the  adaptable  binary: 

•  The  beginning  and  ending  address  for  each  procedure. 

•  Entry  addresses  for  each  procedure. 

•  The  possible  targets  of  indirect  control  transfers. 

Indirect  control  transfer  targets  are  specified  using  tatgei  groups.  Tuget  groups  are  named  sets  of  addresses; 
an  address  ms^  belong  to  any  number  of  target  groups,  and  target  groups  need  not  be  unique.  Target  groups 
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are  uaed  fw  two  reaeone.  Firet,  became  many  indirect  control  transfers  have  the  same  set  possible  target 
addresses  (e.g.  procedure  returns),  target  groups  provide  considerable  space  savings.  Second,  target  groups 
are  uaed  to  support  separate  compilation.  The  targets  for  certain  classes  of  indirect  control  tranrfers,  for 
example  exception  handling,  might  reside  in  files  not  yet  processed  by  the  compiler.  Because  new  addresses 
can  be  added  to  target  groups  at  any  time,  they  provide  a  convenient  level  of  indirection  in  naming  target 
addresses. 

*  Given  the  above  control  information,  our  adaptable  binary  system  locates  basic  blocks  using  the  following 
algorithm: 

•  1.  Initialise  work  list  to  entry  address  of  the  program^.  Repeat  steps  2  and  3  until  the  work  list  is  empty. 

2.  Remove  an  address  address  from  the  work  list  and  create  a  corresponding  basic  block.  Add  instructions 
to  the  current  basic  block,  beginning  at  address,  until  either  a  control  fiow  instruction  is  encountered 
or  there  are  no  more  instructions. 

3.  Given  a  control  flow  instruction,  if  any  of  the  target  addresses  have  not  been  processed  as  in  Step  2, 
add  the  addresses  to  the  work  list. 

Unlike  conventional  algorithms  that  linearly  process  the  program  to  discover  basic  blocks  [ASU86],  the  above 
algorithm  guarantees  that  no  data  is  inadvertently  processed  as  code.  Unreachable  areas  in  the  code  segment 
are  simply  treated  as  data.  Given  the  basic  blocks,  the  next  step  is  to  build  the  control  flow  graph. 

On  architectures  without  delated  control  transfers,  building  the  control  flow  graph  is  a  simple  task  [ASU86]. 
In  the  presence  of  branch  delay  slots,  especially  annulled  delay  slots,  building  the  control  fiow  graph  is 
more  difficult  than  for  higher-level  language  programs,  but  still  straightforward  given  the  adaptable  binary 
information  [LB92]. 

3.3  Relocation  Information 

Edit  operations,  such  as  inserting  instrumentation  code,  can  change  the  address  of  instructions  and  data 
objects.  Adaptable  binaries  provide  information  to  update  these  references  at  transformation  time,  elimi¬ 
nating  the  need  for  techniques  that  incur  run-time  overhead,  such  as  dynamic  relocation  or  out-of-line  code 
insertion. 

Traditional  linkers  combine  one  or  more  object  files  into  a  single  executable,  relying  on  relocation  information 
to  locate  and  update  all  program  references.  Each  relocation  entry  consists  of  a  pointer  to  the  affected  refer¬ 
ence  and  the  operations  required  to  update  it.  IVaditionally,  the  linker  discards  the  relocation  information 
following  creation  of  the  binary. 

Adaptable  binaries  retain  this  standard  inter-file  relocation  information  and  extend  it  with  intra-file  re¬ 
location  information.  In  rare  cases,  inter-file  relocation  does  not  support  changing  the  relative  order  of 
instructions  within  a  file.  Ginsider  the  indirect  control  transfer  depicted  in  Figure  3.  Because  conventional 
inter-file  relocation  information  assumes  that  the  relative  placement  of  instructions  within  a  file  remains  un¬ 
changed,  support  for  updating  the  reference  codsLabsl  is  provided,  but  not,  typically,  the  constant  intra-file 
ofiset  -16.  If  instrumentation  code  is  inserted  between  codsLabsl  and  codeLabsl-16,  the  constant  offset 
must  be  adjusted  through  intra-file  relocation  information. 

3.4  Register  Usage  Information 

As  outlined  in  Section  2,  register  operations  that  allocate  temporary  and  reserved  registers  can  benefit  from 
live  register  information.  For  each  procedure,  adaptable  binaries  store  the  following  information: 

‘Since  the  operating lyatem  recpiiree  the  entry  address  to  begin  program  execution,  we  can  safely  assume  that  it  is  specified 
in  aU  binaries. 
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•  Urmg-gea  Regjbten  whose  value  is  used  in  the  the  procedure. 

•  Uvatg-kiB  Registers  whose  value  is  defined  in  the  procedure. 

•  hvefug-live-oo-ezit  Registers  whose  value  is  live  on  exit  to  the  procedure. 

In  the  presoice  of  callee-saved  registets,  Uvereg-gen  and  tiveng-kill  can  not  be  precisely  derived.  Consider 

the  folkrwing  function:  • 

function  f() 

Memory  *-*  rsg  /save  esf/ee*ssved  reytsfer  k/ore  using  U  ^ 

...  function  body  ... 

rsg  4—  Memory  instore  eai/ee>saved  register  befon  ntnming 


If  there  are  assignments  to  memory  in  f  O’s  body,  conventional  program  analysis  would  conclude  that  f  () 
both  uses  and  defines  the  value  in  rsg.  With  respect  to  the  call  site,  however,  the  value  of  rsg  is  preserved 
across  calls  to  f  (),  and  is  thus  semantically  neither  used  nor  defined.  When  the  compiler  emits  this  register 
spill  code,  it  can  easily  and  precisely  construct  livereg-gea  and  liveng-kill  to  reflect  this.  Because  highly 
optimised  programs  can  violate  standard  calling  conventions,  without  precise  Uvereg-gen  and  livereg-kill 
information,  a  transformation  application  using  rsg  would  be  forced  to  unnecessarily  spill  the  register  before 
calling  f  (). 

3.5  Our  Adaptable  Binary  System 

We  have  built  a  prototype  adaptable  binary  system  (ABS).  The  system  supports  both  ecoff  and  a.out  binary 
formats  and  can  read  OSF/1,  SunOS,  and  Ultrix  binaries.  Most  of  its  code  is  independent  of  the  target 
architecture  and  binary  format.  At  present,  it  only  includes  disassembly  modules  for  the  MIPS  and  Sparc 
architectures. 

We  modified  gcc  to  output  adaptable  binary  (AB)  information.  Because  the  AB  information  was  readily 
available  in  gcc  data  structures,  adding  this  functionality  to  gcc  required  less  than  800  lines  of  C  code  and 
less  than  a  week’s  work. 

In  contrast,  the  ABS  required  several  months  of  implementation  and  is  over  10000  lines  of  C  code.  Most  of  the 
complexity  of  this  implementation  is  in  synthesising  full  control,  relocation,  and  register  usage  information 
from  the  AB  information  and  in  providing  machine-independent  abstractions  for  binary  transformation 
operations.  This  machine-independent  layer  supports  the  construction  of  portable  binary  transformation 
applications. 

In  addition,  we  are  currently  modifying  the  Orbit  T  compiler  to  output  AB  information.  This  compiler 
makes  considerable  use  of  indirect  control  transfer  instructions  in  implementing  continuations  [KKR'^86]. 
At  present,  the  >*ompiler  outputs  the  minority  of  the  necessary  AB  information. 


4  Evaluation 

To  quantity  the  impact  of  adaptable  binaries  on  binary  transformation  applications,  we  performed  two  types 
of  experiments.  First,  we  measured  the  disk  space  overhead  incurred  by  adaptable  binaries.  Second,  we 
measured  the  run-time  transformation  overhead  of  supporting  two  transformation  operations:  instruction 
insertion  and  obtaining  temporary  registers.  Transformation  overhead  is  the  amount  of  time  spent  executing 
instructions  that  support  the  instrumentation  code.  For  example,  instructions  that  save  and  restore  registers 
in  order  to  obtain  them  for  instrumentation  purposes  are  counted  as  contributing  to  the  transformation 
overhead.  Neithtt  the  time  spent  in  the  original  program  nor  the  time  spent  in  instrumentation  code  is 
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part  of  the  transformation  overhead.  Hence,  transformation  overhead  isolates  the  runtime  cost  of  supporting 
binary  transformation  operations.  We  compare  the  overhead  of  programs  transformed  using  previously 
proposed  techniques  to  the  overhead  of  programs  transformed  using  our  AB  system.  In  all  cases,  we  report 
transformation  overheads  relative  to  the  original  program’s  execution  time. 

In  addition  to  measuring  these  basic  transformation  operations  in  isolation,  we  analyzed  the  operations 
required  by  two  binary  transformation  applications;  pixie  and  NaaSpy.  We  measured  the  transformation 
overhead  of  these  operations  as  implemented  by  the  oripnal  programs.  We  then  re-implemented  these 
operations  to  take  advantage  of  AB  information  and  measured  their  transformation  overhead.  Our  experi¬ 
ments  used  the  C  spec92  benchmarks,  and  were  performed  on  a  DecStation  5000/240  with  32  Megabytes 
of  memory. 

4.1  Space  Overhead 

Table  1  presents  the  disk  space  overhead  of  the  control,  relocation,  and  register  usage  information  required  by 
adaptable  binaries.  Our  ABS  stores  adaptable  binary  information  using  the  conventional  binary  symbol  table 
and  compresses  this  data  using  a  variation  of  Ziv-Lempel  compression  [ZL77].  We  found  that  using  a  standard 
compression  algorithm  rather  than  semantic  compression  allowed  simplified  layouts  of  the  adaptable  binary 
information  and  yielded  similar  disk  space  savings.  For  comparison,  we  also  measured  the  space  overhead 
of  the  standard  symbol  table  included  in  all  instripped  executables  and  the  symbol  information  required 
during  debugging. 

4.2  Individual  Operations 

Control  operations  do  not  directly  introduce  run-time  overhead  although  the  lack  of  control  operations  can 
preclude  optimization  of  instrumented  code.  For  example,  QPT’s  use  of  the  control  flow  graph  reduces  the 
number  of  basic  block  counters  by  a  factor  of  two  and  the  number  of  counter  increments  by  up  to  a  factor 
of  four  [BL92].  However,  we  can  directly  measure  the  overhead  of  strategies  for  supporting  edit  and  register 
operations. 


4.2.1  Edit  Operations 

In  order  to  successfully  insert  instructions  into  a  binary  program,  a  binary  transformation  tool  must  correctly 
translate  indirect  control  transfer  instructions.  We  measured  three  strategies  for  doing  this.  First,  the 
transformation  tool  can  use  djmamic  relocation,  prefacing  each  indirect  control  transfer  site  with  a  table 
lookup  that  translates  the  old  jump  target  address  to  the  new  Jump  target  address.  Second,  using  out-of-line 
insertion,  the  transformation  tool  can  avoid  this  problem  by  never  moving  existing  instructions.  Third,  the 
transformation  tool  can  use  the  relocation  information  in  adaptable  binaries  to  safely  update  references  to 
moved  instructions.  We  designate  these  alternatives  dyn-RELOC,  out-of-line,  and  ab  respectively. 

Figure  1  gives  the  code  sequence  used  to  perform  dynamic  relocation.  The  extra  memory  access  instructions 
are  necessary  to  obt^  a  scratch  register.  We  inserted  this  code  sequence  before  every  indirect  control 
transfer  in  the  original  program.  Ihble  2  shows  that  the  average  performance  overhead  of  this  approach  was 
34.3%  on  our  example  programs. 

Table  2  also  gives  the  transformation  overhead  for  OUT-OF-LINE  for  inserting  instrumentation  code  before 
every  load  and  store  instruction.  On  average,  the  out-of-line  incurs  112.5%  execution  time  overhead.  The 
AB  ^proach  incurs  no  execution  time  overhead  for  supporting  instruction  insertion,  since  adaptable  I  'naries 
contain  the  necessary  information  to  statically  update  references  to  moved  instructions. 


7 


Mtaory  <—  t«q^r«g 

Save  tempormiy  renter  t«qp-r«g. 
t«i9-r«g  «—  jnap-targat-addrMB  -  cod«-start-addrMs 
CaleuUUa  oAet  of  current  jump  target  addreae. 
t«ap-r«(  •-  t«i^r«g  +  table  ban*  addraaa 

Add  jump  target  addreae  cdbet  to  base  of  dynamic  relocation  tabk. 
taap-rag  *-  MaaoryCtaap-rag] 

Load  new  jump  target  addreae  6om  rehcatioa  tabk. 
tanp-rag  *-  Maaory 

Reatore  tanpmary  regiater  taaq^rag. 


Figure  1:  Aaaembly  pseudo  code  for  dynamic  relocation. 


4.2.2  Regiater  Operationa 

We  measured  the  coat  of  obtaining  2,  4,  and  8  temporary  registers  before  every  load  and  store  instruction 
ui  our  benchmark  programs.  Because  they  do  not  have  accurate  control  flow  or  register  usage  information, 
transformation  tools  not  using  adi^table  binaries  must  save  and  restore  the  required  registers.  With  AB 
information,  a  transformation  tool  can  accurately  compute  live  register  information  and  save  only  the  live 
registers.  For  example,  'IU>le  2  shows  that  the  execution  time  overhead  of  obtaining  4  registers  with  AB 
information  is  only  30.6%,  compared  to  157.3%  without  AB  information. 

4.3  Applications 

The  measurements  of  individual  operations  presented  above  suggest  that  adaptable  binaries  can  increase  the 
efficiency  of  transformed  binary  programs.  In  this  section,  we  present  case  studies  of  two  binary  transfor¬ 
mation  applications:  instruction  profiling  and  memory  system  siimilation.  For  instruction  profiling,  we  used 
the  commercial  program  pixia.  For  memory  system  simulation,  we  used  the  research  system  HanSpy. 

4.3.1  Pixia 

Pixia  transforms  a  binary  program  to  count  basic  blocks.  It  then  uses  the  count  of  basic  blocks  to  compute 
instruction  frequencies,  wasted  cycles  due  to  floating  point  pipeline  interlocks,  and  other  instruction  profiling 
information.  The  transformed  program  simply  counts  basic  blocks  auid  outputs  the  counts  to  a  file  for  post¬ 
processing. 

We  measured  the  transformation  overhead  for  basic-block  counting  of  the  original  pixia  and  a  version  that 
takes  advantage  of  AB  information.  The  AB  version  is  both  faster  and  more  accurate  than  the  original.  It  is 
more  accurate  because  it  can  identify  all  possible  targets  of  indirect  control  transfers  and  therefore  all  basic 
block  boundaries.  Pixia  must  rely  on  heuristics  to  guess  basic  block  boundaries. 

In  addition  to  using  a  form  of  dynamic  relocation  to  support  editing  operations,  the  original  pixia  imple¬ 
mentation  obtains  two  scratch  registers  and  one  reserved  register.  Table  4  gives  the  transformation  overhead 
for  the  original  and  AB  versions. 

For  our  benchmark  programs,  pixia  approximately  doubles  the  end-to-end  running  time  of  the  program.  Of 
this  overhead,  110.6%  is  due  to  the  Pixie’s  transformation  overhead,  and  40.4%  to  maintaining  counters  for 
each  basic  block.  Thus,  adaptable  binaries  reduce  pixia 's  end-to-end  overhead  by  approximately  a  factor 
of  four.  For  ^>pllcation8  that  use  control  flow  anaijrsis  to  optimise  the  placement  of  counters,  such  as  QPT, 
adaptable  binaries  will  provide  even  larger  relative  savinp. 
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4.3.2  MMiSpy 


HmSpj  measuKs  memory  system  performance.  It  inserts  instrumentation  code  before  every  load  and  store 
instruction  in  the  original  program,  as  well  as  all  procedure  entry  and  exit  points.  Currently,  NeaSpy 
uses  the  Tango  simulation  system  to  modify  the  assembly  language  version  of  a  program.  In  doing  so,  it 
avoids  the  instruction  insertion  overhead  incurred  by  pixie.  However,  Section  5  describes  some  important 
disadvantages  of  this  choice. 

Because  the  simulation  code  is  sufficiently  complex,  the  NeaSpy  designers  chose  to  insert  call  instructions 
before  each  load  and  store,  rather  than  to  duplicate  the  simulation  code  inline.  In  addition,  the  simulation 
code  is  written  in  C,  and  thus  assumes  that  caller-save  registers  are  available  for  use. 

Since  NsaSpy  does  not  currently  have  accurate  register  livenew  information,  it  must  save  and  restore  all 
the  caller-save  registers.  Using  the  register  information  available  in  adaptable  binaries,  we  instrumented 
our  benchmark  programs  to  save  only  those  caller-saved  registers  that  were  live  at  each  call  to  the  NsnSpy 
simulation  routines.  Thble  4  shows  that  this  use  of  AB  information  reduces  the  transformation  overhead 
due  to  register  operations  of  NsaSpy  by  more  than  a  factor  of  39.  NsaSpy’s  authors  report  that  register 
operations  account  for  approximately  60%  of  NsaSpy’s  total  simulation  overhead  [MGA92],  and  hence  the 
use  of  adaptable  binaries  can  eliminate  the  majority  of  NsaSpy’s  end-to-end  overhead. 


5  Related  Work 

A  number  of  systems  have  added  information  to  binaries  to  facilitate  transformation  systems.  Johnson 
modified  the  Stardent  linker  Id  to  retain  inter-file  relocation  information  [Joh90].  In  addition,  Stardent 
compilers  were  restricted  from  performing  certain  machine-dependent  optimizations  and  placing  data  in  the 
code  segment.  In  contrast,  adaptable  binaries  place  no  constraints  on  compilers.  Further,  the  Stardent 
system  did  not  consider  control,  intra-file  relocation,  or  register  information. 

Like  Johnson’s  system,  spoxie  retains  inter-file  relocation  information  [Wal91].  Rather  than  modify  the 
linker,  spoxie  can  use  any  linker  that  supports  incremental  linking.  Incremental  linking,  (e.g.  Id  -r), 
retains  relocation  information,  allowing  already  combined  object  files  to  be  repeatedly  linked. 

The  Mahler  system  performs  transformations  in  the  linker,  providing  transformation  applications  with  inter¬ 
file  relocation  information  [Wal92].  Like  Stardent ’s  Id  and  spoxie,  no  control  or  intra-file  relocation  is 
included  in  the  binary.  Mahler  performs  several  optimizations,  including  global  register  allocation.  To 
accomplish  this,  register  actions  are  included  which  tell  the  linker  how  to  modify  the  binary  if  it  decides  to 
promote  a  variable  from  memory  to  a  register.  Register  actions  provide,  however,  only  limited  support  for 
determining  live  register  information. 

In  addition,  systems  have  used  compiler-dependent  heuristics  to  approximate  the  control,  relocation  and 
register  information  present  in  adaptable  binaries  [Wai91,  GH90,  LB92,  SW92].  Control  and  relocation  in¬ 
formation  is  synthesized  by  pattern  matdiing  for  compiler-dependent  instruction  idioms.  For  example,  to  find 
jump  tables,  NTool  searches  for  the  instruction  sequence  it  assumes  will  implement  case  statements  [GH90]. 
Larus  and  Ball  exploit  register  usage  conventions  when  allocating  registers,  relying  on  programmer  input  to 
discover  cases  in  which  these  assumptions  are  invalid  [LB92]. 

Relying  on  compiler-dependent  heuristics  reduces  the  scope  and  effectiveness  of  a  binary  transformation 
application.  For  highly  optimized  code,  finding  a  reliable  heuristic  might  be  difficult  or  impossible.  Programs 
frequently  use  libraries  written  in  different  high-level  languages  or  generated  by  different  compilers.  In  these 
cases,  different  and  potentially  incompatible  heuristics  might  be  necessary.  Because  there  are  no  constraints 
on  the  compiler,  it  is  impossible  to  insure  that  the  heuristics  cover  ail  possible  sequences  of  generated  code. 
For  example,  the  Ultrix  C  compiler  will  violate  normal  MIPS  register  usage  conventions  when  asked  to 
do  global  register  allocation[Digb].  Finally,  a  binary  transformation  tool  that  relies  on  compiler-dependent 
heuristics  must  be  updated  and  tested  with  each  new  release  of  the  compiler. 


9 


Adaptable  binaries  avoid  these  difficulties  by  including  extra  information  in  the  executable  representation  of 
a  program.  Once  annotated  with  this  information,  a  binary  program  or  library  can  support  a  large  range  of 
binary  transformation  applications,  regardless  of  its  origin. 

Finally,  researchers  have  performed  transformations  on  higher  level  program  representations.  For  example, 
modif^ng  assembly  or  object  files  addresses  the  problem  of  deriving  relocation  information.  It  does  not 
addreM  the  need  for  control  and  register  usage  information. 

F\irther,  instrumenting  a  program  at  higher  levels  of  abstraction  poses  the  significant  problem  that  many 
rltinnrii  of  program  events  are  influenced  by  compilation  decisions  not  fully  resolved  until  the  binary  is  created. 
For  example,  instrumenting  a  compiler’s  intermediate  form  can  significantly  affect  the  code  generated.  On 
many  systems,  the  assembly  representation  contains  high-level  peeudo  instructions,  hiding  machine  instruc¬ 
tion  selection.  On  MIPS  systems,  the  assembler  performs  machine  instruction  selection,  delay  slot  filling,  and 
instruction  scheduling  [Dig^.  The  relative  order  of  procedures  as  well  as  final  addresses  are  not  established 
until  the  object  files  are  processed  by  the  linker.  As  mentioned  above,  the  Titan  system  linker  performs 
inter-procedural  optimisations  [Wal92]. 


6  Conclusion 

We  have  described  adaptable  binaries,  a  technique  for  supporting  robust  and  efficient  binary  transformation. 
First,  we  identified  a  set  of  operations  that  are  fundamental  to  binary  transformation.  At  least  one  of  these 
operations  is  necessary  for  every  binary  transformation  application  that  we  surveyed.  Second,  we  defined  the 
miniftMin  information  required  for  efficient  and  robust  support  of  these  operations.  We  detailed  the  subset 
of  this  information  that  can  not  be  derived  from  current  binary  programs  and  demonstrated  that  this  subset 
of  information  can  be  added  to  executable  files  with  negligible  space  overhead.  Finally,  we  demonstrated 
quantitatively  that  augmenting  binary  programs  with  this  information  can  dramatically  reduce  the  run-time 
overhead  of  instrumented  programs. 

Adaptable  binaries  establish  the  necessary  and  sufficient  information  that  any  compiler  must  provide  to 
support  the  basic  binary  transformation  operations.  Once  annotated  with  this  information,  a  binary  program 
or  library  can  support  a  large  range  of  binary  transformation  applications,  regardless  of  its  origin.  We  hope 
that  this  work  will  influence  compiler-writers  and  maintainers  to  make  the  output  of  adaptable  binary 
information  a  standard  compiler  option. 
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A  Figures 


original  program  out-oMine  patch 


Figure  3:  Two  indirect  control  transfer  instructions.  The  first  juap  requires  only  inter-file  relocation  infor¬ 
mation.  Because  the  second  juap’s  target  address  is  computed  using  an  intra-file  ofiset  (-16)  relative  to  a 
relocated  label  (codeLabel),  both  inter-file  and  intra-file  relocation  information  is  required. 
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B 


Performance  Data 


Benchmark 

Ad^table  Binary 
Information 

Standard  Header 

Standard  Header  + 
Debug  Sjrmbols 

OSS.alvian 

FP 

5% 

955S 

10055 

O26.coqp(ras8 

INT 

8% 

94?S 

l07^ 

066. ear 

TF“ 

82?? 

10756 

OSS.e^tott 

INT 

9% 

91^ 

13156 

008. espresso 

TnT" 

9% 

15156 

001.geel.36 

INT 

929S 

12456 

022.11 

INT 

9f5S 

I8l56 

072.se 

INT 

TTK 

50^ 

8156 

Average 

123^ 

Table  1:  Diak  space  overhead  for  adaptable  binary  information.  For  context,  we  also  present  th  space 

overhead  for  the  standard  header  included  in  all  unstripped  programs,  as  well  as  the  symbol  os  when 

program’s  are  compiled  for  debugging. 


Benchmark 

OYN-RELOC 

OUT-OF-LINE 

062 . alvlnn 

FP 

551^ 

112.2% 

026 . eoaprass 

INT 

iO^ 

l04l^ 

066. ear 

FP 

28!4% 

98.8% 

O23.e^tott 

INT 

20^ 

88l^ 

008 . espresso 

INT 

ioj^ 

001.geel.36 

INT 

58^556 

022.11 

INT 

48.5% 

170.1% 

072.se 

INT 

30^ 

71.3% 

Average 

34.3% 

112.596 

Table  2:  IVansformation  overhead,  relative  to  native  execution  time,  for  dynamic  relocation  and  out-of-line 
insertion  strategies  for  inserting  instrumentation  code  before  each  load  and  store  in  the  program. 


V 
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Table  3:  TVansfonnation  overhead,  relative  to  native  execution  time,  of  obtaining  the  specified  number  of 
scratch  registers  at  each  load  and  store  instruction. 


Table  4:  Transformation  overhead,  relative  to  native  execution  time,  for  the  applications  pixie  and  HeaiSpy. 
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